Class Prediction in Test Sets with Shifted Distributions
نویسندگان
چکیده
Machine learning has provided powerful algorithms that automatically generate predictive models from experience. One specific technique is supervised learning, where the machine is trained to predict a desired output for each input pattern x. This chapter will focus on classification, that is, supervised learning when the output to predict is a class label. For instance predicting whether a patient in a hospital will develop cancer or not. In this example, the class label c is a variable having two possible values, “cancer” or “no cancer”, and the input pattern x is a vector containing patient data (e.g. age, gender, diet, smoking habits, etc.). In order to construct a proper predictive model, supervised learning methods require a set of examples xi together with their respective labels ci. This dataset is called the “training set”. The constructed model is then used to predict the labels of a set of new cases xj called the “test set”. In the cancer prediction example, this is the phase when the model is used to predict cancer in new patients. One common assumption in supervised learning algorithms is that the statistical structure of the training and test datasets are the same (Hastie, Tibshirani & Friedman, 2001). That is, the test set is assumed to have the same attribute distribution p(x) and same class distribution p(c|x) as the training set. However, this is not usually the case in real applications due to different reasons. For instance, in many problems the training dataset is obtained in a specific manner that differs from the way the test dataset will be generated later. Moreover, the nature of the problem may evolve in time. These phenomena cause pTr(x, c) ≠ pTest(x, c), which can degrade the performance of the model constructed in training. Here we present a new algorithm that allows to re-estimate a model constructed in training using the unlabelled test patterns. We show the convergence properties of the algorithm and illustrate its performance with an artificial problem. Finally we demonstrate its strengths in a heart disease diagnosis problem where the training set is taken from a different hospital than the test set.
منابع مشابه
Classification and properties of acyclic discrete phase-type distributions based on geometric and shifted geometric distributions
Acyclic phase-type distributions form a versatile model, serving as approximations to many probability distributions in various circumstances. They exhibit special properties and characteristics that usually make their applications attractive. Compared to acyclic continuous phase-type (ACPH) distributions, acyclic discrete phase-type (ADPH) distributions and their subclasses (ADPH family) have ...
متن کاملImproving Minority Class Prediction Using Case-Specific Feature Weights
This paper addresses the problem of handling skewed class distributions within the case-based learning (CBL) framework. We rst present as a baseline an information-gain-weighted CBL algorithm and apply it to three data sets from natural language processing (NLP) with skewed class distributions. Although overall performance of the baseline CBL algorithm is good, we show that the algorithm exhibi...
متن کاملImproving Minority Class Prediction Using Case-Speci c Feature Weights
This paper addresses the problem of handling skewed class distributions within the case-based learning (CBL) framework. We rst present as a baseline an informationgain-weighted CBL algorithm and apply it to three data sets from natural language processing (NLP) with skewed class distributions. Although overall performance of the baseline CBL algorithm is good, we show that the algorithm exhibit...
متن کاملThe Weighted Exponentiated Family of Distributions: Properties, Applications and Characterizations
In this paper a new method of introducing an additional parameter to a continuous distribution is proposed, which leads to a new class of distributions, called the weighted exponentiated family. A special sub-model is discussed. General expressions for some of the mathematical properties of this class such as the moments, quantile function, generating function and order statistics are derived;...
متن کاملMMDT: Multi-Objective Memetic Rule Learning from Decision Tree
In this article, a Multi-Objective Memetic Algorithm (MA) for rule learning is proposed. Prediction accuracy and interpretation are two measures that conflict with each other. In this approach, we consider accuracy and interpretation of rules sets. Additionally, individual classifiers face other problems such as huge sizes, high dimensionality and imbalance classes’ distribution data sets. This...
متن کامل